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ABSTRACT 

The Integrated Microbial Genomes (IMG) data ware- 
house integrates genomes from all three domains of 
life, as well as plasmids, viruses and genome frag- 
ments. IMG provides tools for analyzing and review- 
ing the structural and functional annotations of 
genomes in a comparative context. IMG's data 
content and analytical capabilities have increased 
continuously since its first version released in 
2005. Since the last report published in the 2012 
NAR Database Issue, IMG's annotation and data in- 
tegration pipelines have evolved while new tools 
have been added for recording and analyzing 
single cell genomes, RNA Seq and biosynthetic 
cluster data. Different IMG datamarts provide 
support for the analysis of publicly available 
genomes (IMG/W: http://img.jgi.d0e.g0v/w), expert 
review of genome annotations (IMG/ER: http://img. 
jgi.doe.gov/er) and teaching and training in the area 
of microbial genome analysis (IMG/EDU: http://img. 
jgi.doe.gov/edu). 

DATA SOURCES AND PROCESSING 

The Integrated Microbial Genomes (IMG) system inte- 
grates genomes from all three domains of life, as well as 
viruses, plasmids and genome fragments (partial 
sequences of genomic regions of interest, such as biosyn- 
thetic clusters). Until 2012, IMG used NCBI's RefSeq 
resource (1) as its main source of public genome 
sequence data and annotations consisting of predicted 
genes and protein products, with a RefSeq- specific 
pipeline used for retrieving new genomes from RefSeq's 
ftp site. For non-public (i.e. 'private') datasets, the IMG 



ER Submission system allowed scientists to select their 
sequencing projects in GOLD (2) and then submit their 
genome sequence data for annotation and integration into 
the 'Expert Review' version of IMG, IMG/ER (http:// 
img.jgi.doe.gov/er). Public and private genomes were pro- 
cessed using different annotation and data integration 
pipelines, and recorded in different databases. 

In an effort to improve the efficiency of data processing 
and tracking, IMG's genome submission, annotation and 
integration pipelines were consolidated in November 2012. 
The IMG ER Submission system (http://inig.jgi.doe.gov/ 
submit) and associated (submission, gene prediction, func- 
tional annotation and data integration) data processing 
pipelines were extended to handle both public and 
private genomes in a uniform manner. The pipelines use 
a common mechanism for tracking the processing status 
of genome datasets, GOLD provides the information 
needed for retrieving new public genomes from RefSeq 
or GenBank (3) and both public and private genomes 
are recorded in a common IMG data warehouse. 

For every genome, the IMG data warehouse records 
primary genome sequence information including its 
organization into chromosomal replicons (for finished 
genomes) and scaffolds and/or contigs (for draft 
genomes), together with predicted protein-coding 
sequences, some RNA-coding genes and protein product 
names that are provided by the genome sequence centers 
or generated by IMG's functional annotation pipeline. 

Public and private genomes submitted for annotation 
and integration by IMG's pipelines are first associated 
with sequencing projects in GOLD. Custom tools and 
metadata about the topology of contigs and scaffolds 
are used to identify the origin of replication of circular 
replicons and permute the corresponding scaffold or 
contig if necessary. To ensure accurate identification of 
partial genes bordering the gaps, gene models and other 
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features are initially predicted on individual contigs and 
combined thereafter to generate scaffold-level structural 
annotation. CRISPR elements are detected using CRT 
(4) and PILERCR (5). Predictions from both methods 
are concatenated, and in case of overlapping elements, 
the shorter one is removed. Identification of tRNAs is 
performed using tRNAScan-SE-1.23 (6). Ribosomal 
RNA genes (5S, 16S and 23S) are predicted using 
hmmsearch against the custom models generated for 
each type of rRNA in bacteria and archaea (7,8). With 
the exception of tRNA and rRNA, all models from 
Rfam (9) are used to search the genome sequence. 
Sequences are first compared with a database containing 
all the non-coding RNA genes in the Rfam database using 
BLAST, then sequences that have hits to genes belonging 
to an Rfam model are searched using the program 
INFERNAL (10). Signal peptides are computed using 
SignalP (11), whereas transmembrane helices are 
computed using TMHMM (12). Protein-coding genes 
are predicted using Prodigal (13); models overlapping 
with CRISPRs and certain types of RNAs (e.g. rRNAs) 
are removed. 

After a new genome is processed, protein-coding genes 
are compared with protein families and the proteome of 
selected publicly available 'core' genomes, with product 
names assigned based on the results of these comparisons. 
First, protein sequences are compared with COG (14) 
using RPS-BLAST, Pfam-A (15) using HMMER 3.0b2 
executed inside Sanger's pfam_scan.pl wrapper script 
and TIGRfam (16) databases using HMMER 3.0 (8), 
and associated with KEGG Orthology (KO) terms (17) 
using USEARCH (18). Genomes in IMG are associated 
with KEGG pathways using the assignment of KO terms 
to protein-coding genes, while their association with 
MetaCyc pathways (19) is based on correlating enzyme 
EC numbers in MetaCyc reactions with EC numbers 
associated with protein-coding genes via KO terms. 
Genes are further characterized using an IMG native col- 
lection of generic (protein cluster-independent) functional 
roles called IMG terms that are defined by their associ- 
ation with generic (organism-independent) functional 
hierarchies, called IMG pathways (20). IMG terms and 
pathways are specified by domain experts at DOE-JGI 
as part of the process of annotating specific genomes of 
interest, and are subsequently propagated to all the 
genomes in IMG using a rule-based methodology. 
Transporter genes are linked to the Transport 
Classification Database (21) based on their assignment 
to COG, Pfam or TIGRfam domains or IMG terms 
that correspond to transporter families. 

The integration of new genomes into IMG involves 
computing protein sequence similarities between their 
genes and genes of all other (new or existing) genomes 
in the system, assigning IMG terms and protein product 
names to the genes of the new genomes, identifying fusions 
and computing conserved gene cassettes (putative 
operons). For each gene, IMG provides lists of related 
(e.g. homolog, paralog and ortholog) genes that are 
based on sequence similarities computed using 
USEARCH for protein-coding and RNA genes. A fused 
gene {fusion) is defined as a gene that is formed from the 



composition (fusion) of two or more previously separate 
genes (22). Fusions are identified based on computing 
USEARCH similarities between genes. Only genes from 
finished genomes are considered as putative components 
to avoid false predictions from fragmented genes in draft 
genomes. Furthermore, genes that frequently appear as 
fragmented in finished genomes, such as 'transposases' 
and 'integrases', as well as 'pseudogenes' are excluded 
from fusion calculations. Putative horizontally transferred 
genes are identified from the sequence similarity data. The 
phylogenetic distribution of best hits against a set of ref- 
erence isolate genomes also provides additional informa- 
tion on possible horizontal gene transfers for isolates. 
A 'chromosomal cassette' is defined as a stretch of genes 
with intergenic distance <300 bp, whereby the genes can 
be on the same or different strands of the chromosome. 
Chromosomal cassettes with a minimum size of two genes 
common in at least two separate genomes are defined as 
'conserved chromosomal cassettes'. The identification of 
common genes across organisms is based on two gene 
clustering methods, namely, participation in COG and 
Pfam clusters (23). 

Note that for public and private genomes that are 
already associated with genes and/or protein product 
names, the native gene and/or product names are 
preserved in IMG unless their replacement is explicitly 
requested at the time they are submitted for annotation 
and integration into IMG. 

DATA CONTENT 

Genomics data 

The content of IMG has grown steadily since the first 
version released in March 2005, with the current version 
of IMG (as on 10 September 2013) containing 11 568 bac- 
terial, archaeal and eukaryotic genomes, an increase of 
>300% since August 2011 (24). IMG also includes 2848 
viral genomes, 1198 plasmids that did not come from a 
specific microbial genome sequencing project and 581 
genome fragments, bringing its total content to 16195 
genome datasets with >42 million protein-coding genes. 

The number of single cell genomes included into IMG 
has increased substantially: there are 1341 single cell 
genomes in the current version of IMG compared with 
only 21 in August 2011. Approximately 240 single cell 
genomes are part of the Microbial Dark Matter project 
that aims to expand the Genomic Encyclopedia of 
Bacteria and Archaea by targeting 100 single cell repre- 
sentatives of uncultured candidate phyla (25). 

IMG has 1 3 342 genome datasets that are publicly avail- 
able to all users without restrictions via the IMG/W 
datamart (http://img.jgi.doe.gOv/w). Genomes that have 
not been yet published (also known as 'private') are 
password-protected and available only to the scientists 
who study ('own') them through the IMG/ER ('Expert 
Review') datamart (http://img.jgi.doe.gov/er). Private 
genomes are usually publicly released 6 months after the 
dataset becomes available in IMG. 

IMG/ER allows individual scientists or groups of scien- 
tists to review and curate the functional annotation of 
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microbial genomes in the context of IMG's public 
genomes (26). Since August 2011, hundreds of private 
genomes have been reviewed and curated using IMG/ 
ER, a relatively small fraction of the 9000 genomes that 
were processed by IMG's data annotation and integration 
pipelines, as genome curation is a time-consuming process. 
Genome curation is usually carried out for identifying 
missing genes or for correcting functional annotations, 
e.g. as part of the process of curating IMG native terms 
and pathways. 

Omics data 

Proteomics datasets have been gradually included into 
IMG starting in 2009. Since August 2011, 64 new 
protein expression datasets (samples) that are part of 
two studies were included into IMG, bringing the total 
to 90 samples across five studies. The organization and 
analysis of proteomic data in IMG are discussed in (24). 

The first RNAseq (transcriptomic) datasets included 
into IMG in 2011 are part of the Synechococcus PCC 
study consisting of ~40 samples (Billis,K., Billini,M., 
Kyrpides,N.C, and Mavromatis,K., submitted for 



publication). As of August 2013, IMG contains 99 
samples across 10 RNASeq studies. A typical RNASeq 
study involves the sequencing of cDNA from a genome 
under different experimental conditions, with the effect of 
each experimental condition being captured by a sample. 
As part of RNASeq sequencing analysis, reads are 
mapped to the reference genome involved in the study, 
and the expressed genes in each sample are recorded 
with their observed read counts, mean, median and 
strand. RNA reads are mapped to reference genomes 
using Bowtie2 (27). The scope of mapping is determined 
by the type of cDNA sample (sscDNA/dscDNA) and the 
directionality of the libraries, whereby reads may map to a 
single strand or both strands of the reference sequence. 
Expression levels are normalized by computing RPKM 
(reads per kilobase per million), Quantile or Affine trans- 
formations and may need to be interpreted based on the 
type of cDNA in the sample. For genomes involved in 
RNASeq studies, the experiments/samples are recorded 
in IMG together with experimental conditions, and 
the read counts are organized per expressed gene, as 
illustrated in Figure 1. 
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Figure 1. RNA-Seq data organization, (i) 'Omics' datasets generated can be accessed from 'IMG Statistics' on IMG's front page, following the 
Experiments link available on the 'IMG Statistics' page, (ii) An RNA-Seq study is associated with samples and the number of genes expressed across 
all samples, (iii) Each sample is associated with the number of expressed genes, the total number of reads and the average number of reads per gene, 
(iv) An expressed gene is associated with a read count (total number of reads divided by the size of the gene) and normalized coverage (coverage for 
a gene in the experiment divided by the total number of reads in that experiment). 
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Biosynthetic clusters 

IMG contains biosynthetic clusters of genes associated 
with pathways involved in the generation of secondary me- 
tabolites in isolate prokaryotic genomes. Experimentally 
validated biosynthetic clusters were identified by searching 
NCBI's nucleotide database for genome fragments 
(partially sequenced genomes) containing gene clusters 
associated with secondary metabolites/natural products 
(28). Additional biosynthetic clusters were predicted 
using ClusterFinder (Fischbach, submitted for publica- 
tion). Biosynthetic clusters in IMG are associated with 
IMG, Metacyc and KEGG pathways as well as informa- 
tion available in GOLD on their natural products. 

Genomes associated with biosynthetic clusters can be 
examined as illustrated in Figure 2, where these genomes 
are listed in descending order of the number of biosyn- 
thetic clusters present in them. Alternatively, IMG can be 
used to find genomes associated with natural products 
associated with genome fragments but not with biosyn- 
thetic clusters, as illustrated in Figure 2(v). Natural 
products are small metabolites found in nature, and 



although the biosynthetic clusters associated with the 
generation of natural products have been identified, 
there are still natural products whose production mechan- 
isms in prokaryotes remain unknown. 

ANALYSIS TOOLS 

Browsers and search tools allow finding and selecting 
genomes, genes and functions of interest, which can then 
be examined individually or analyzed in a comparative 
context. Gene content-based comparison of genomes is 
provided by the 'Phylogenetic Profiler' and the 
'Phylogenetic Profiler for Gene Cassettes' tools that 
allow identifying genes in a query genome in terms of 
presence or absence of homologs in other genomes, or 
participation in conserved gene cassettes across other 
genomes (29,30). Function-based comparison of 
genomes is provided by the 'Abundance Profile 
Overview' and 'Function Profile' tools that allow 
comparing the relative abundance of protein families 
(COGs, Pfams, TIGRfams) and functional families 
(enzymes) across genomes. The composition of analysis 
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Figure 2. Biosynthetic clusters, (i) Genomes associated with biosynthetic clusters can be retrieved and examined using the 'Genome Browser', (ii) The 
number of biosynthetic clusters is provided in the 'Genome Statistics' section of the 'Organism Detail' page of a genome, together with a hyperlink to 
(iii) the list of biosynthetic clusters, whereby for each cluster the number of associated genes, the evidence type and the corresponding natural product 
are provided, (iv) A biosynthetic cluster can be examined using the 'Biosynthetic Cluster Detail' page, which includes information about the cluster, 
(v) 'Natural Product List' provides the list of the IMG genomes associated with natural products. 
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Figure 3. RNA-Seq data exploration, (i) The list of RNA-Seq studies associated with a genome can be accessed from its 'Organism Details', with 
each study associated with (ii) a list of RNA-Seq experiments (samples). Individual samples can be selected for further analysis, such as 
(iii) examining its expressed genes as a list or using the (iv) chromosome viewer. A sample can be also examined in the context of (v) pathways 
that have at least one enzyme associated with an expressed gene in the sample, whereby for each pathway (vi) enzymes are displayed with colors 
representing the level of expression for the associated genes; mousing over an enzyme shows the number of expressed genes associated with the 
enzyme. 



operations is facilitated by genome, scaffold, gene and 
function 'carts' that handle lists of genomes, scaffolds, 
genes and functions, respectively. IMG analysis tools 
have been discussed in (24). Tools for identifying and cor- 
recting annotation anomalies, such as dubious protein 
product names, and for filling annotation gaps, such as 
genes that may have been missed by gene prediction tools 
or genes without predicted functions, are discussed in (26). 
IMG analysis tool extensions have addressed performance 
(30), data quality control, such as single cell data decon- 
tamination (31), and new data types, such as RNA-Seq 
and biosynthetic cluster data. 

RNA-Seq studies can be accessed from the 
'Experiments Statistics' section of the 'IMG Statistics' 
page or the 'Organism Details' pages of their associated 
genomes. For example, the 'Expression Studies' link in the 
'Organism Details' page, such as that shown in Figure 3(i), 
leads to the list of associated RNA-Seq samples, as the list 
shown in Figure 3(h). 



RNA-Seq studies associated with a genome can be 
compared using pairwise or multiple sample analysis 
tools as illustrated in Figure 4. After samples are 
selected for comparison (Figure 4(i)), pairs of samples 
can be compared in terms of up- or downregulation 
of genes, as illustrated in Figure 4(h), with a thresh- 
old specified for the difference in gene expression. The 
difference in expression is computed using the 
logR = log2( query /reference) or the RelDiff= 2 (query- 
reference ) I (query + reference) metric. The comparison 
can be first previewed using a histogram, as illustrated in 
Figure 4(iii), which can help set the thresholds for the 
search of over-expressed or under-expressed genes 
between a pair of samples. The result of the comparison 
can be examined at the level of individual up- and 
downregulated genes, which can be selected for inclusion 
in the 'Gene Cart' for further analysis. Alternatively, the 
result of the comparison can be examined in terms of func- 
tions, as illustrated in Figure 4(iv), with genes associated 



Nucleic Acids Research, 2014, Vol. 42, Database issue D565 



DMAC C» A Svnecnocvstis so PCC 6803 

KIN A06C] otUQy stuov Sequencing the Iranscnptome ol Synechocysos sp PCC 6803 



0) 



Seled Samples 



View in GBrowse Single Sample Analysis Pairwise Sample Analysis Uultip 



Select 


Sample ID 


Sample Name 


Genes 
with Reads 


Total Reads 
Count 


Average Reads 
per Gene 


m 


1267.7.1260 


Temperature 24 hours 


3622 


8120990 


2242 13 


Si 


1274.1.1267 


Temperature 1 hour 


3622 


11726524 


3237.58 


□ 


1272.1 1268 


Reference 


3622 


2911654 


80388 


D 


1272.2.1268 


pH ?4 hours 


3622 


9275761 


2560.95 



Expression by Function for Selected Samples 




(iv) 






Average 


Averaae 




Gene 
Count 


Expression 
Temperature 24 

hours 
11267 71260] 


Expression 
Temperature 1 

hour 
[1274 1 1267) 


logR 


2-Oxoc3rbon1ic add metaoolism 


22 


244.455 


178823 


-0.45104 


ABCIransDorteis 


85 


473.710 


315836 


-0.58483 


Adioocvtokine sianalmo oathwav 


1 


79 324 


69 861 


•018327 



RNASCC] Study stu<h- SequerianathetiwsCTiDtomeof Svnechoevsiis sp PCC 6803 
Seled Samples View in GBrowse Single Sample Analysis 



P31rw.se Sample Analysis 



Find Up/Down Regulated Genes 

Reference 

» Temperature 24 hours 
Temperature 1 hour 

Metric 

• iogR=iog2(Query/ reference) 
ReiOrff=2t query - reference V(quer> * reference) 

Threshold 1 (defaul.=l> 



Spearman's Rank Correlation 



] 



Compare by Function 

Linear Regression 

Regress 











! -I 




• . 




id 



RNASeq Study Sas„«; 

Single Sample Analysis jj Pairwise Sample Analysis 
Cluster Samples 
Clustering Method 

• Pairwise compiete-iinKaoe (default) 
Patrwtse single-linkage 
Pairwise average-linkage 

Distance Measure 

Pearson correlation (default) 
Spearman's rank correlation 

* EucJidean distance 
City-block distance (Manhattan) 



trans cnplom* of Syn«cfiocys&s sp PCC 6803 



Multiple Sample Analysis 



(vii) 



Cluster 



Map Clusters to Pathways 



(vi) 



KEGG Map: Biosynthesis of secondary metabolites (viii) 



Cluster Product Name 



Temperature Temperature 
Reference 24 hours 1 hour 

[12721.1268] 11267 71260] (12741 1267 



sclanesyt 

diphosphate 

synthase 



0 218 ■ 




!Clust«r(s): l) K00602. purH. 
phosphonbosvlammoimldazolecarboxamide 
formyltransferase / IMP eyclohydrolase 
[EC:2. 1.2.3 3.5.4.10] 



Purine 
I metabolism 



Figure 4. RNA-Seq data comparison, (i) RNA-Seq sample comparison starts with the selection of samples of interest, (ii) 'Pairwise Sample Analysis' 
supports comparing samples in terms of up/downregulated genes, with (iii) a histogram preview helping setting the thresholds for comparison, (iv) 
The result of the comparison can be examined in terms of functions, whereby genes associated with KEGG pathways or COG functions are grouped 
together, (v) The strength of the association of gene expression between pairs of samples can be examined using 'Spearman's Rank Correlation', (vi) 
'Linear Regression' analysis helps estimate whether two samples are technical replicates, (vii) 'Multiple Sample Analysis' consists of clustering 
samples based on the abundance of expressed genes, using a variety of clustering methods, (viii) Clusters of samples can be examined in the 
context of pathways, whereby enzymes are displayed with colors representing the cluster. 



with KEGG pathways or COG functions grouped 
together. Genes associated with a specific KEGG 
pathway can be examined in the context of the pathway, 
similar to the example shown in Figure 3(vi) earlier. The 
strength of the association between pairs of samples can 
be examined using 'Spearman's Rank Correlation', as 
illustrated in Figure 4(v), whereas 'Linear Regression' 
analysis, illustrated in Figure 4(vi), helps determine 
whether two samples are technical replicates. 

Multiple RNA-Seq sample analysis usually involves 
clustering based on the abundance of expressed genes, 
where the proximity of grouping indicates the relative 
degree of similarity of samples to each other. There is a 
choice of clustering methods, such as pairwise complete 
linkage and pairwise single linkage, and distance measure, 
such as Pearson correlation, Spearman's rank correlation 
and Euclidean distance, as illustrated in Figure 4(vii). The 
result of clustering is displayed as a hierarchical tree of 



samples and a normalized heat map of coverage values for 
each gene for each sample. Clusters of multiple samples 
can be also examined in the context of pathways, as 
illustrated in Figure 4(viii), whereby enzymes are 
displayed with colors representing the cluster. 

FUTURE PLANS 

IMG's genome sequence data content is maintained 
through regular updates managed by the IMG submission 
system and involving new genomes sequenced at JGI, 
genomes sequenced at other organizations and submitted 
for inclusion into IMG by scientists worldwide and 
genomes from Genbank. For genomes with multiple sub- 
missions, only the latest version is kept in IMG. IMG 
genome data are distributed through genome data 
portals available at: http://genome.jgi.doe.gov/. IMG's 
data annotation and integration pipelines have been 
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automated, thus improving their ability to keep pace with 
the rapidly increasing number of sequenced genomes. 

IMG's integrated data framework allows assessing and 
improving the quality of genome annotations. Thus, the 
quality of gene models for genomes available in public 
resources is known to vary greatly depending on the 
quality of sequence and the software used for annotation. 
An analysis conducted at JGI of the protein-coding genes 
of microbial genes in Genbank indicates that ~10% (>1 
million) of predicted protein-coding are erroneous: they 
are false-positive genes, unidentified pseudogene frag- 
ments or genes with translational exceptions or have 
incorrectly predicted start sites. To improve the consist- 
ency of annotation and the quality of predicted genes, all 
public microbial genomes in IMG will be re-annotated 
using IMG's annotation pipeline. 

A rapidly increasing number of single cell genomes are 
included into IMG. Typically, the first version of a single 
cell genome is analyzed for identifying contigs that may 
come from contaminant (e.g. Pseudomonas, Ralstonia) or- 
ganisms. The sequence of analysis steps needed to identify 
and remove contaminated contigs is described in (31). 

The importance of functional genomics in validating 
gene function in an integrated comparative genomics 
context is also being underscored, pushing experimental 
data from methylomics and transposon mutagenesis 
experiments into IMG. Systematic paradigms for 
associating computationally predicted gene structural 
and functional information with experimental functional 
genomics are being constructed. Tools are being 
developed for mining and visualizing different types of 
Omics datasets in an integrated genomic context. 

IMG's users are faced with the increasing burden of 
analyzing a rapidly growing number of genomic 
datasets. This analytical challenge can be alleviated by 
synthesizing genomic data using the 'pangenome' concep- 
tual abstraction (32). A pangenome consists of the core 
part of a species (i.e. the genes present in all of the 
sequenced strains or of all samples of a microbial commu- 
nity) and the variable part (the genes present in some but 
not all of the strains or samples). An experimental version 
of IMG has been extended with five pangenomes, as well 
as analysis tools and viewers that allow users to explore 
individual pangenomes and compare pangenomes and 
genomes. A public version of IMG containing pangenome 
data and analysis tools is expected to be released in the 
near future. 
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